This particular lesson is written in R in the so called R-markdown format. It is assumed that you have R and Rstudio installed. In this case you will be able to follow all steps by running the code in the grey boxes beneath. For further information on getting R and Rstudio see the Prerequisites-section of the book R for Data Science.
This lesson is the first concrete example of how to interact with a specific API and we pick up exactly where we left in the previous lesson What is an API?. The last thing we did in that lesson was to ask the Royal Danish Library’s Newspaper API to tell us how many articles mentions “internet”. The answer was returned in the JSON-format, which we will save for later, since the Newspaper API can also return answers in CSV-format, which will be the case of this example. CSV is short for Comma Separated Values and is a way of storing data in a raw text format. CSV-files are easily handled by most programming languages and especially R. The main focus of this lesson will therefore be on constructing an request URL to the Newspaper API as explained in the previous chapter.
As a general rule of thumb it is always best to examine and understand the data that you’re trying to extract and the service which stores them and how they make them available before you dive into the API. This process will be total dependent on the specific case and in our case with the Newspaper API it involves diving into what this collection contains. In the following section we will dive into a very short survey of the Danish Newspaper Collection’s history to fully understand the .
The collections exist because legal deposit of published material has been required by law in Denmark since 1697. In the light of this Danish Newspapers has been collected and stored for the future. This led to a lot of physical paper and the library began to photograph the individual pages of each newspaper and store it on microfilm instead. Then from 2014-2017 these microfilms were digitized. This involved a computer running a segmentation algorithm, which runs through all these now digital pages and identifying which headers belonged to which paragraphs thus forming articles. Along with this the computer also recognized the text thus making it searchable. The process of recognizing the text is called Optical Character Recognition(OCR). These processes were not precise and especially not on the older newspapers, which cause a lot of “misreading” in the OCR-text and in the segmentation of articles. The result is an ALTO-file, which is short for Analyzed Layout and Text Object. This is highly structured data format which stores information on where the individual OCR-recognised words are placed on the page as well as which article they belong to. The best way to imagine an ALTO-file is as a file, that contains the digital layout as recognised by segmentation and OCR. The combination of the ALTO-file and the digital photograph of the newspaper pages forms a pdf-file that consist of two “layers”. One which is the actual picture of the newspaper pages and another layer containing the OCR text making the pdf-file searchable.
Visualization of the digitization process of the newspapers - in the segmentation and OCR the colors indicates which text parts has been identified as belonging to each other
The result is of course a lot of pdf-files, but there is also a lot of metadata around these pdf files. For example we have the time of publication, the place of publication and which newspaper it is. All this data is presented and made available through an graphical user interface that normal users can interact with. In the case of the newspaper collection this platform is called Mediestream.
Let’s use the graphical user interface on a specific case. In this case we want to find articles from the correspondent sent out by the newspaper “Dagbladet”. These articles should be on internal affairs in France and in Paris and about the politician Charles de Rémusat in the year 1873. The screenshot below shows how the this search is performed in Mediestream. Red circles marks demarcation-elements in the interface that are of particular interest:
Example search free text search, specification of newspaper, as well as defining time range in the selector tool in the graphical user interface
The top circle is the free text search field. This is where we define that the word “korrespondent”, “paris” and “rémusat” must be present in the OCR text of the article that we are looking for. The next circle is where we define the time period of interest. In this case by pointing and clicking through months and years eventually defining from the 1. January 1873 to 31. December 1873 - in other words the entire year of 1873. The last circle is where we have defined that we only have interest in hits in the newspaper “Dagbladet”. The results in 9 hits which means that 9 articles(identified as such in the segmentation process) meet our requirements.
This exact search could have been performed entirely from the free
text search field using more advanced search codes. Behold this search:
This results in the exact same. 9 hits from the newspaper “Dagbladet”. So what has been done differently? Notice the free text search field - here we have appended “py:1873” to our search before. This is an “advanced” search code setting the publication year(py) to 1873. Notice how the time selector is blank - this is because it haven’t been used. Furthermore the search code “familyId:dagbladetkoebenhavn1851” has been added, which says that we are only interested in results from the newspaper “Dagsbladet”. Since “Dagsbladet” is a fairly popular name for at newspaper(imagine something like “Daily News”) we are using a unique id for this particular newspaper. All the newspapers in Mediestream has been given unique ids to avoid ambiguity. Thus we end up with a search string that looks like this:
korrespondent AND paris AND rémusat AND py:1873 AND familyId:dagbladetkoebenhavn1851
In order to extract raw data from the newspaper API we need to be able to define the data that we are interested with these kind of advanced search strings. It is a good idea to test the search strings in Mediestream and once you are happy with the amount of hits you take your advanced search string to the API. For more help on constructing search strings see the page for search advice in Mediestream, where you’ll also find a link to a list of the aforementioned unique ids for the newspapers.
One important thing to add before venturing on is the limitations in access to the newspaper collection due to copyright. The is because of the newspapers being at the library due to the legal deposit of published material. Thus the some of the material is still under copyright meaning that you can only see newspapers older than 100 years and in order to extract data from the newspaper API the material must be older than 140 years.
Before venturing on to extracting data from the newspaper API with a search string let’s create a string that has more than 9 hits by expanding the time range and removing rémusat, in order to get articles containing paris and korrespondent in the periode 1870 to 1875:
korrespondent AND paris AND py:[1870 TO 1875] AND familyId:dagbladetkoebenhavn1851
This search gives us 644 hits. Now we have a some what large body of material and we wan’t to employ some kind og digital method on them. This can’t be done in the graphical user interface of Mediestream. We need to turn our focus to the API connected to Mediestream
In order to extract the 644 as raw data in a machine readable format we use the Swagger interface for the newspapers API. A Swagger interface is an interactive documentation of an API. This means that you can both try the APIs functionality and get information about which metadata and data is exported. Furthermore the interface shows how you can limit your search. The existence of a Swagger interface (or similiar) is a good sign for data extraction, because it means that the creators have thought about disseminating the APIs functionalities.
Navigating to the Newspaper
API Swagger UI will lead you to the following landing page: What we see in
the blue boxes is all the different services that the API offers - and a
text explaining summarizing what these services do. These are called
endpoints of the API. In this case we will focus on the first service
described in the top blue box, the endpoint:
/aviser/export/fields - Export data from old newspapers at http://mediestream.dk/
Clicking on this box expands the view:
Expanding the /aviser/export/fields endpoint
The next step is to paste in the search string from before into the query field, which replaces the placeholder search:
The dataset is processed in the software R, offering various methods for statistical analysis and graphic representation of the results. In R, one works with packages each adding numerous functionalities to the core functions of R. In this example, the relevant packages are:
Documentation for each package:
https://www.tidyverse.org/packages/
https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
https://lubridate.tidyverse.org/
https://ggplot2.tidyverse.org/
*https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html
Additional information about R: https://www.r-project.org/
library(tidyverse)
library(tidytext)
library(lubridate)
library(ggwordcloud)
The dataset is loaded into R. This is done via a retrieve link from
the API. This link is created by the Swagger
UI, which is documentation and user interface for the API. Here we
have specified that we want newspaper data from the St. Croix Avis from
the year 1878. This data is loaded into R with the read_csv
function since we also have specified the data format to be CSV in the
Swagger UI:
croix <- read_csv("http://labs.statsbiblioteket.dk/labsapi/api/aviser/export/fields?query=py%3A1878%20AND%20familyId%3Astcroixavisdvi&fields=link&fields=recordID&fields=timestamp&fields=pwa&fields=cer&fields=fulltext_org&fields=pageUUID&fields=editionUUID&fields=titleUUID&fields=editionId&fields=familyId&fields=newspaper_page&fields=newspaper_edition&fields=lplace&fields=location_name&fields=location_coordinates&max=5000&structure=header&structure=content&format=CSV")
## Rows: 2831 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): link, recordID, fulltext_org, pageUUID, editionUUID, titleUUID, e...
## dbl (4): pwa, cer, newspaper_page, newspaper_edition
## dttm (1): timestamp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
CSV is short for Comma Separated Values that is a way of structuring
a dataset in plain text. CSV files are structured in columns separated
by commas and in rows separated by lines. Each row in the data
correspond to identified articles by the segmentations-process during
the digitisation process of the newspapers.
In the output from the read_csv-function R tells us which
columns are present in the dataset and what type of data it has
recognised in the column’s rows. Most of them are “col_character()”,
which means the rows in the column contains textual data (character
signs). Others have the “col_double()”, which means the rows in the
column contains numbers. This is a question of datatypes, which can be
very important when coding, but in the case of this workshop we won’t
work further with them.